132 research outputs found

    NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment

    Full text link
    This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.Comment: 5 page

    Bounds on the Sum-Rate of MIMO Causal Source Coding Systems with Memory under Spatio-Temporal Distortion Constraints

    Get PDF
    In this paper, we derive lower and upper bounds on the OPTA of a two-user multi-input multi-output (MIMO) causal encoding and causal decoding problem. Each user’s source model is described by a multidimensional Markov source driven by additive i.i.d. noise process subject to three classes of spatio-temporal distortion constraints. To characterize the lower bounds, we use state augmentation techniques and a data processing theorem, which recovers a variant of rate distortion function as an information measure known in the literature as nonanticipatory ϵ-entropy, sequential or nonanticipative RDF. We derive lower bound characterizations for a system driven by an i.i.d. Gaussian noise process, which we solve using the SDP algorithm for all three classes of distortion constraints. We obtain closed form solutions when the system’s noise is possibly non-Gaussian for both users and when only one of the users is described by a source model driven by a Gaussian noise process. To obtain the upper bounds, we use the best linear forward test channel realization that corresponds to the optimal test channel realization when the system is driven by a Gaussian noise process and apply a sequential causal DPCM-based scheme with a feedback loop followed by a scaled ECDQ scheme that leads to upper bounds with certain performance guarantees. Then, we use the linear forward test channel as a benchmark to obtain upper bounds on the OPTA, when the system is driven by an additive i.i.d. non-Gaussian noise process. We support our framework with various simulation studies

    Wavenet based low rate speech coding

    Full text link
    Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.Comment: 5 pages, 2 figure

    Monitoring VoIP Speech Quality for Chopped and Clipped Speech

    Get PDF

    AMBIQUAL – a Full Reference Objective Quality Metric for Ambisonic Spatial Audio

    Get PDF
    Streaming spatial audio over networks requires efficient encoding techniques that compress the raw audio content without compromising quality of experience. Streaming service providers such as YouTube need a perceptually relevant objective audio quality metric to monitor users’ perceived quality and spatial localization accuracy. In this paper we introduce a full reference objective spatial audio quality metric, AMBIQUAL, which assesses both Listening Quality and Localization Accuracy. In our solution both metrics are derived directly from the B-format Ambisonic audio. The metric extends and adapts the algorithm used in ViSQOLAudio, a full reference objective metric designed for assessing speech and audio quality. In particular, Listening Quality is derived from the omnidirectional channel and Localization Accuracy is derived from a weighted sum of similarity from B-format directional channels. This paper evaluates whether the proposed AMBIQUAL objective spatial audio quality metric can predict two factors: Listening Quality and Localization Accuracy by comparing its predictions with results from MUSHRA subjective listening tests. In particular, we evaluated the Listening Quality and Localization Accuracy of First and Third-Order Ambisonic audio compressed with the OPUS 1.2 codec at various bitrates (i.e. 32, 128 and 256, 512kbps respectively). The sample set for the tests comprised both recorded and synthetic audio clips with a wide range of time-frequency characteristics. To evaluate Localization Accuracy of compressed audio a number of fixed and dynamic (moving vertically and horizontally) source positions were selected for the test samples. Results showed a strong correlation (PCC=0.919; Spearman=0.882 regarding Listening Quality and PCC=0.854; Spearman=0.842 regarding Localization Accuracy) between objective quality scores derived from the B-format Ambisonic audio using AMBIQUAL and subjective scores obtained during listening MUSHRA tests. AMBIQUAL displays very promising quality assessment predictions for spatial audio. Future work will optimise the algorithm to generalise and validate it for any Higher Order Ambisonic formats
    • …
    corecore